Simple combination of multiple somatic variant callers to increase accuracy

Publications comparing variant caller algorithms present discordant results with contradictory rankings. Caller performances are inconsistent and wide ranging, and dependent upon input data, application, parameter settings, and evaluation metric. With no single variant caller emerging as a superior standard, combinations or ensembles of variant callers have appeared in the literature. In this study, a whole genome somatic reference standard was used to derive principles to guide strategies for combining variant calls. Then, manually annotated variants called from the whole exome sequencing of a tumor were used to corroborate these general principles. Finally, we examined the ability of these principles to reduce noise in targeted sequencing.


Scientific Reports
| (2023) 13:8463 | https://doi.org/10.1038/s41598-023-34925-y www.nature.com/scientificreports/ of each variant caller and variant caller combination when indels were called. Figure 3 plots the F1 measure of variant caller combinations applied to SNVs, under various intersection thresholds. The intersection threshold is the minimum number of callers a variant must be called by in order to be accepted. Figure 3B,D show median and maximum F1 increasing as the number of combined callers increases. In Fig. 3A, the highest median F1 is obtained by accepting variants called by n − 1 callers. Figure 3C shows that for most caller combinations, the highest F1 measure is obtained at a threshold of n − 1. In Fig. 3A,B, data points are blue if the combination contains MuSE and, when 3 or more callers are combined, has a minimum caller Whole exome sequencing. Figure 4A plot precision, sensitivity and F1 measures, when variants were accepted when called by a minimum intersection threshold (from 7 to 1) of 7 combined SNV callers. Figure 4B,C show the effect on F1 measure of increasing the number of combined variant callers, when callers are added in either ascending or descending order by individual caller performance.
Amplicon sequencing. The noise returned by any single caller will not entirely overlap with a different caller, therefore noise should be reduced by accepting the intersections of combined callers. Counts of the remaining false positives are given in Table 2, when accepting calls made by a minimum number of callers. False positive counts were also totaled after removing the noisiest caller. However, this did not dramatically increase noise at minimum intersection thresholds n, n − 1, or n − 2.

Conclusions
In summary, the simple heuristics developed during this study for SNV calling are: 1. Multiple variant caller combination increases accuracy. 2. Accept variants called by n − 1 callers, where n is the total number of callers. That is, sensitivity is maintained by keeping the positives that are only (false) negative in a single caller. 3. Without prior knowledge of which variant callers are best suited to a dataset, accuracy may be increased by increasing the number of combined callers. 4. Removal of the worst performing caller from a combination does not necessarily increase accuracy In contrast, indel calling requires more judicious selection and combination of variant callers:

Methods
Whole genome sequencing. We aimed to combine a reasonably low number of variant callers while still covering a broad range of conceptual calling approaches by selecting callers representative of the 6 types of core variant callers algorithms identified in Xu's 2018 review of somatic variant callers 21 . The 'machine learning' core type was excluded when initial exploration suggested that variants called were overly dependent on data used for training, also termed 'batch effects' 22 . A widely-used indel caller, Pindel 4 , was added as it takes a pattern growth approach unlike any of the 6 core algorithm types described by Xu. To prevent over-fitting and to enable replication, default parameters were used and parameter fine-tuning was avoided.
In their 2016 paper "A somatic reference standard for cancer genome sequencing" 20 , Craig et al. reported high coverage (99x) whole genome sequencing of a matched metastatic melanoma cell line (COLO829) and normal by Illumina HiSeq. We aligned the FASTQ files sequenced by Illumina to human genome version hg19/GRCh37 using BWA-MEM 23 , and called variants using the seven variant callers listed in Table 1.
Variants from the resultant VCF files were retained if their 'FILTER' field entry was either "PASS" or ". " and variant allele depth was > 3, and variant allele frequency > 0.02. This low stringency filtering was intended to remove some random "stochastic noise" without also removing the "deterministic noise" (noise/false positives systematically introduced by variant callers). The threshold was chosen after examination of a density histogram plotting variant allele depth of the caller that called the most variants, Pindel. The variants called by each caller were compared to the indels or SNVs of the Craig et al. reference standard using VCFtools 24 and RTG tools 25 .  FN) and their harmonic mean, the F1 measure. This study emphasizes maximizing F1, however maximizing sensitivity and minimizing false negatives may be preferred in clinical contexts such as biopsies of low tumor content, or investigation of specific implicated genes. We did not test for caller accuracy on low or very low frequency variants. All possible combinations of variant callers were tested, and variants were accepted if called by, or by more than, a minimum intersection of callers. That is, a minimum caller intersection threshold was applied to only accept calls made by at least a minimum of the number of callers. If n is the total number of combined callers, then this minimum intersection ranged between 2 to n. Every intersection threshold was tested in every possible combination of callers.
Whole exome sequencing. The COLO829 somatic reference standard was produced from a cancer cell line, and lacks the complexity and heterogeneity of a bulk tumor sample. Therefore, we manually annotated 26  We evaluated the level of background noise by applying amplicon sequencing, using the ThunderBolts Cancer Panel (Bio-Rad), to ten germline normal samples. True germline allele frequencies can only be 0.5 (heterozygous) or 1 (homozygous). Noise variants called by each caller with frequency outside of 0.4-0.6 (heterozygous) or 0.9-1 (homozygous) were intersected. The noise returned by any single caller will not entirely overlap with a different caller, therefore by combining callers and accepting their intersections, a large amount of noise will be discarded. Counts of the remaining false positives are given in Table 2, when accepting calls made by a minimum number of callers. Somatic calls, as made in the previous WGS and WES analyses, had germline calls and shared noise subtracted, while no calls were subtracted from these germline calls. Table 2. Counts of false positives called from 10 germline samples by a combination of 5 or 4 callers, across a range of minimum intersection thresholds.